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Abstract 

Because of methodological breakthroughs and the availability of an increasing amount of whole-genome sequence data, horizontal 
transfers (HTs) in eukaryotes have received much attention recently. Contrary to similar analyses in prokaryotes, most studies in 
eukaryotes usually investigate particular sequences corresponding to transposable elements (TEs), neglecting the other components 
of the genome. We present a new methodological framework for the genome-wide detection of all putative horizontally transferred 
sequences between two species that requires no prior knowledge of the transferred sequences. This method provides a broader 
picture of HTs in eukaryotes by fully exploiting complete-genome sequence data. In contrast to previous genome-wide approaches, 
we used a well-defined statistical framework to control for the number of false positives in the results, and we propose two new 
validation procedures to control for confounding factors. The first validation procedure relies on a comparative analysis with other 
species of the phytogeny to validate HTs for the nonrepeated sequences detected, whereas the second one built upon the study of the 
dynamics of the detected TEs. We applied our method to two closely related Drosophila species, Drosophila melanogaster and 
D. simulans, in which we discovered 1 0 new HTs in addition to all the HTs previously detected in different studies, which underscores 
our method's high sensitivity and specificity. Our results favor the hypothesis of multiple independent HTs of TEs while unraveling a 
small portion of the network of HTs in the Drosophila phytogeny. 
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Introduction 

Thanks to next-generation sequencing (NGS) technologies 
and to recent advances in de novo genome-assembly algo- 
rithms, we now have access to an increasing number of com- 
plete eukaryotic genomes. This methodological shift toward 
deep sequencing has changed the scale of investigation for 
many genomic studies and now allows the study of horizontal 
transfers (HTs) between eukaryotic species (Gilbert etal. 2010, 
2013; Gilbert and Cordaux 2013). 

HTs are defined by an exchange of genetic material be- 
tween two reproductively isolated organisms (Gilbert et al. 
2009) or by a movement of genetic information across 
normal mating barriers between more or less distantly related 
organisms (Keeling and Palmer 2008). Contrary to prokary- 
otes, for which HTs are common and well described (Fall et al. 
2007; Juhas et al. 2009; Weinert et al. 2009), HTs are thought 
to be rare in eukaryotes, and their underlying mechanisms 
remain unknown (Andersson 2005). Proposed hypotheses to 
explain HTs in eukaryotes range from virus-mediated HTs 
using direct transfer of episomes (O'Brochta et al. 2009), 



viral particles, or infection (Kim et al. 1994; Dupuy et al. 
2011) to parasite-mediated transfers (Gilbert et al. 2010). 
Overall, the main difference between eukaryotes and pro- 
karyotes regarding HTs resides in the type of DNA material 
that is transferred: HTs usually involve genes in prokaryotes 
(Ochman et al. 2000), whereas in eukaryotes, HTs usually in- 
volve noncoding DNA and transposable elements (TEs) 
(Schaack et al. 2010). Following the availability of complete 
assembled genomes, more attention has been directed to 
the detection of HTs in eukaryotes, but most studies rely on 
similar approaches to the ones used for the detection of HTs in 
prokaryotes (Doyon et al. 201 1). However, the differences in 
the type of horizontally transferred sequences between pro- 
karyotes and eukaryotes and in the quantity of DNA to be 
investigated raise specific methodological challenges that 
need to be addressed to obtain a broader picture of 
genome-wide HT dynamics in eukaryotes (de Carvalho and 
Loreto 2012). 

A particularity of the detection of HTs in eukaryotes is that it 
first requires the genome-wide identification of candidate 
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pairs of sequences that are not necessarily predefined. This 
point has motivated the development of several approaches, 
such as the surrogate method, which relies on differences in 
nucleotide patterns consistent with foreign DNA(Ragan 2001; 
Putonti et al. 2006). Nevertheless, this type of approach dis- 
plays such a high rate of false detections that it is not efficient 
for real case studies (Azad and Lawrence 2011). Other 
genome-wide approaches start with all-to-all Blast searches 
between genomes of many species and detect HTs using an 
arbitrary cutoff using e-values (Shi et al. 2005) or a lineage 
probability index (Podell and Gaasterland 2007). However, 
although these strategies have given better results than the 
surrogate method, the lack of statistical framework for the 
detection of HTs in both methods has limited the interpreta- 
tion of their results and the precise assessment of their speci- 
ficity and sensitivity (de Carvalho and Loreto 2012). Overall, 
the promises of genome-wide approaches have been tem- 
pered by these common drawbacks, which explain the prev- 
alence of sequence-specific approaches in HT studies even for 
genome-wide data sets. 

When focusing on sequence-specific approaches to study 
HTs, we can discriminate between tree-topology-based 
approaches and sequence-divergence-based approaches. In 
prokaryotes, the gold standard for detecting HTs relies on 
the study of incongruences between the phytogeny of the 
sequences undergoing HTs and the phytogeny of the species. 
Because the pairwise identity of a horizontally transferred se- 
quence is higher than expected according to the divergence 
time of the two species (Silva et al. 2004; Loreto et al. 2008), a 
phylogenetic tree based on this sequence will be discordant 
from the species tree. Unfortunately, phylogenetic approaches 
require a large taxonomic sampling of genes to have sufficient 
power of detection, which is often lacking in eukaryotes. 
Moreover, this method poorly differentiates HT genes from 
ancestral gene duplication(s) followed by gene loss(es) (Roger 
1999). Phylogenic incongruences can also be produced when 
two or more variants of the ancestral lineage sequence have 
been stochastically inherited by the derived lineages (Dias and 
Carareto 2012). Finally, another pitfall of these approaches is 
the possibility of phylogenetic reconstruction artifacts, which 
can lead to strongly supported but false trees and thus to false 
positives for HT detection. 

Studying pairwise sequence divergences constitutes an al- 
ternative that is commonly used when working with eukary- 
otes. It can rely on different divergence metrics, such as the 
synonymous substitution rates (d s or K s ), to test the consis- 
tency of the number of synonymous differences accumulated 
between two sequences with the divergence time between 
the two species. Confounding factors can also decrease the 
power of c/ s -based approaches. Codon usage bias, for in- 
stance, can result in a reduced d s for the reference genes, 
which can decrease the sensitivity of detection of sequences 
with low d s (Wallau et al. 2012). Purifying selection and var- 
iable rates of sequence evolution can also lead to spurious HT 



detections or a lack of power for identity-based methods 
(Capy et al. 1994; Pace et al. 2008). Finally, a third line of 
evidence for the detection of HTs is a patchy distribution of 
the sequences within a group of taxa (as they are not vertically 
transmitted). However, because of stochastic losses, the lack 
of coverage of some parts of the genomes and the random 
sampling of the population alleles in the sequenced strains, 
this third line of evidence is hardly self-sufficient to infer an HT 
event (Keeling and Palmer 2008; Schaack et al. 2010). 

One strategy to control for spurious HT detections has been 
to focus on one line of evidence for the detection of HTs and 
to rely on the two others for validation purposes (Loreto et al. 
2008; Gilbert et al. 2010). However, when dealing with 
eukaryotes, the absence of evidence for phylogenetic incon- 
gruences and the absence of a patchy distribution are likely to 
be poor validating arguments, as they do not constitute strict 
evidence against the possibility of an HT (Wallau et al. 2012). 
Another weakness of current sequence-specific approaches is 
that both tree-topology- and sequence-divergence-based 
approaches are restricted to coding sequences (CDSs). This 
represents only a small part of most eukaryote genomes 
and introduces an important detection bias for the analysis 
of horizontally transferred sequences. 

In eukaryotes, for which HT events involve noncoding DNA 
and TEs, only 330 cases of horizontally transferred TEs have 
been described to date (Wallau et al. 2012) compared with 
rates as high as 30% of lateral gene transfers per phylogenetic 
branches for prokaryotes (Abby et al. 2012). TEs are DNA 
segments that are able to replicate and insert themselves 
into the genome using different mechanisms (Finnegan 
1997; Wicker et al. 2007; Jurka et al. 201 1). One of the out- 
standing features of TEs is their ability to cross species bound- 
aries and invade new genomes (Daniels et al. 1990; Pinsker 
et al. 2001 ; Ludwig et al. 2008). These elements can represent 
the most abundant part of large eukaryotic genomes, as is the 
case of the maize genome (85%) (Schnable et al. 2009) and of 
the human genome (between 45% and 78% according to 
the detection method [Lander et al. 2001; de Koning et al. 
2011]). 

Notably, among the 330 horizontally transferred TEs 
detected, 178 concern drosophilid species, and from the 
101 putative HT events proposed in Drosophilae in 2008, 
only 15% were confirmed by the three lines of evidence we 
have mentioned (Loreto et al. 2008). Regardless of this 
overrepresentation of drosophilids, the majority of these 330 
HT detections relied on sequence-specific studies of candidate 
sequences. With this approach, only a small part of the 
genomes is exploited, which leads to an underestimation of 
the number of HTs. Our proposed genome-wide approach 
aims to solve this bias by requiring no prior knowledge 
concerning the sequences of interest and evaluating all the 
identifiable pairs of sequences between two genomes with an 
identity-based approach. Our method addresses the detection 
of all HTs genome wide as a multiple-testing problem to 



Genome Biol. Evol. 6(2):41 6^32. doi:10.1093/gbe/evu026 Advance Access publication February 3, 2014 



417 



Modolo etal. 



GBE 



handle this large number of identity-based detections and to 
control the proportion of false positives in the results (Wei 
et al. 2009). We also propose two new filtering methods to 
sort out spurious HT detections corresponding to conserved 
sequences in the results. 

We applied our method to the genome-wide detection of 
all putative HT sequences between two Drosophila species: 
Drosophila melanogaster and D. simulans. These two cosmo- 
politan Drosophila species have a divergence time estimated 
between 4.3 and 6.5 Myr (Tamura et al. 2004) and are highly 
similar on many points, except in their TE content. TEs in 
D. melanogaster represent a large amount of the genome 
(15% [Dowsett and Young 1982]), with mainly young and 
active (highly similar) copies (Bowen and McDonald 2001; 
Kaminker et al. 2002; Lerat et al. 2003). In contrast, the TEs 
in D. simulans are represented mainly by old and degraded 
copies (Lerat et al. 201 1) and only account for 6.85% of the 
genome (Hu et al. 2013). To explain the differences in the TE 
landscape between these two species, previous studies based 
on a restricted number of TEs have shown that numerous HTs 
were likely to be involved (Bartolome et al. 2009; Lerat et al. 
201 1) (see Carareto [201 1] for a review). To obtain a broader 
picture of HT between these two genomes, we performed a 
whole-genome comparison study between D. melanogaster 
and D. simulans assuming that undefined fragments of DNA 
may have been transferred from one species to the other. 
These undefined fragments of DNA can contain any types 
of sequences, such as TEs, nuclear genes, or intergenic 
DNA, thus removing any detection bias toward CDSs. As a 
result, we detected 10 new putative horizontally transferred 
TEs in addition to all the horizontally transferred TEs described 
by different studies between D. melanogaster and D. simu- 
lans, bringing to light a portion of the rich network of HTs that 
seems to link together the Drosophila species. 

Materials and Methods 

Our method can be divided into two main parts. For the first 
part, it relies on a multiple-testing framework to identify with a 
high sensitivity all the sequences that may have been horizon- 
tally transferred between two species at the genome scale. 
This approach is divided into three different steps described 
later. Then, we developed a multiple-testing framework to 
evaluate the output of multiple identity-based detections of 
HTs while controlling for the expected proportion of false pos- 
itives in the results. A novelty of our approach is the modeling 
of the data throughout the genome as candidate sequences 
that are structured spatially, accounting for their dependency 
structure with a nonhomogeneous Markov model (NHMM) to 
increase the power of the multiple-testing correction (Kuan 
and Chiang 2012). For the second part of our method, we 
discriminate between putative HTs and other mechanisms, 
leading to a high pairwise identity to increase our specificity. 
For this purpose, we propose two novel validation procedures 



that can be applied for genome-wide studies to control for the 
numerous sources of spurious detections inherent to the de- 
tection of HT. 

We will thereafter introduce the software, the algorithms, 
and the statistical models that we used for the different parts 
of this approach (supplementary fig. S1, Supplementary 
Material online). In our application, genome A corresponds 
to the genome of D. melanogaster and genome B to the 
genome of D. simulans. 

Description of the Tree Steps for the Detection of 
Putative Horizontally Transferred Sequences between 
Two Genomes A and B 

Step 1: Selection of the Sequences of Interest 

To identify HT events, we define a sequence of interest as part 
of a pair of sequences with a higher pairwise nucleotidic iden- 
tity than expected between the two species A and B. This first 
part of the pipeline aims to delimit such sequences in the two 
genomes. To achieve this goal, we start by retrieving the list of 
all the identifiable pairs of sequences between the two species 
A and B. For this step, we performed a nucleotidic all-to-all 
Blast (version 2.2.26) of one genome against the other 
(Altschul et al. 1990). The output of such a Blast defines a 
many-to-many cardinality between sequences from the two 
species, meaning that a given sequence from one species 
can be linked to many sequences in the other species, and 
vice versa. These types of links are complex and represent a 
large quantity of data to address. Moreover, as we cannot 
observe two different horizontally transferred sequences at 
the same locus in the species A, we filter the resulting pairs 
of sequences to only retain the best match for each position 
of the genome of A. For the task at hand, we only need the 
best local alignments of sequences for each position along the 
genomes because the other alignments would have a lower 
identity and thus a lower probability to correspond to an HT 
event. 

To parse the Blast output and obtain a one-to-one cardi- 
nality from a many-to-many cardinality, we developed in 
python the program htdetect.py (available from the 
online resources). This program uses the fact that when work- 
ing on two different genomes, there is always a genome of 
better quality (genome A) than the other genome (genome B). 
Our algorithm can be divided into the four following stages 
(fig. D: 

1 . Compute the identity between each pair of sequences and 
the corresponding P-values to account for the identity and 
the size of the pair of sequences (see unilateral binomial 
test later) (fig. 1,4). 

2. Order all the pairs of sequences according to their position 
in genome A (fig. 16). 

3. Merge all the overlapping pairs of sequences in genome A 
to obtain a one-to-many cardinality from the many-to- 
many cardinality (fig. 1 Q. 
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Fig. 1. — Algorithm to reduce the many-to-many cardinality in the 
results of an all-to-all nucleotidic Blast to a one-to-one cardinality between 
a genome A (red) and a genome B (blue). (4) Compute the identity be- 
tween each pair of sequences and the corresponding P values (see 
Materials and methods, stage 3), and order all the pairs of sequences 
according to their position on genome A (the sequence order is 1-2-3). 
(6) Merge all the overlapping pairs of sequences in the genome A to go 
from a many-to-many cardinality to a one-to-many cardinality (remove the 
dashed part of the sequence 3. 1 ). (0 Keep the sequences with the lowest 
P values from genome B for each pair of sequences that were merged in 
stage 3 to obtain a one-to-one cardinality (remove the dashed sequence 
2.2). (D) One-to-one cardinality between the two genomes. 

4. Keep the sequences with the lowest P values from genome 
B for each pair of sequences that have been merged in step 
3 to obtain a one-to-one cardinality (fig. 1Cand D). 

In the hypothetical case where both genomes are of equiv- 
alent quality, the above steps will be strictly symmetrical to 
obtain a one-to-one cardinality. 

Step 2: Computation of the Expected Pairwise Identity 
between the Compared Species 

To test Ho : "the number of differences is greater than or 
equal to the expected number of differences," for each of 
the filtered sequences, we compute the expected pairwise 
nucleotide identity between the species A and B, given their 
time of divergence. For this purpose, we used a global pair- 
wise alignment of the genome of the species B against the 
genome of species A. We compute the number of identical 
nucleotides for each nonoverlapping window of size 1 kb 
along each chromosome arm of the species A. The size of 
1 kb was empirically chosen as a trade-off between the reso- 
lution for the identity computation (of 0.01%) and informa- 
tion about the identity variation (a large window size only 
gives access to the average identity). To compute the nucleo- 
tide identity percentage between the species A and B for each 
of these windows, we removed the unknown nucleotides and 
the gaps from the computation. 

We then used a Gaussian kernel smoothing function of 
these nonoverlapping window identity scores to obtain the 



distribution of the nucleotide identity between the two spe- 
cies. As this identity distribution is skewed to the right in the 
case of our application to Drosophila species (fig. 2), we chose 
to use the highest mode of this distribution as the expected 
pairwise identity between the two genomes, instead of the 
mean or a given quantile. 

Step 3: Test of the Sequence Pairwise Identity 

To model the pairwise identity, for every pair of sequences n, 
we denote by W n the number of different nucleotides 
between the two sequences. The distribution of W n is 
B(L n ,p n ), where L n is the length of the pair of sequences of 
interest and p„ is the probability of having a nucleotidic dis- 
similarity. Our aim is to test H Q : [p n < p 0 } accounting for L n , 
in which 1 - po is the expected identity calibrated using the 
reference distribution constructed from the global alignment 
of the two genomes (=95.62% for our application). Thus, we 
compute for each pair of sequences n the probability 
P(w° bs ) = P{W n > w° bs ] of having a number of different 
nucleotides lower than expected, or unilateral P-value. 

The number of tests N equals the number of candidate 
pairs of sequences for each chromosome arm and for the 
whole genome. Thus, for a given level of type I error (e.g., 
a = 0.05), with a crude estimate under independence of the 
tests, the number of false positives (N x a) can be larger than 
the number of positives. 

At each position along the genome of species A, we have a 
P-value denoted by P(w n ) that is distributed according to a 
uniform distribution in [0,1] under H 0 . From this P value, we 
want to infer an indicator variable denoted by S n , such that 
S n = 1 if H 0 is rejected at position n and S n = 0 otherwise. To 
proceed, we use the local false discovery rate (IFDR) strategy, 
which consists in assessing the posterior probability that S n is 
under H 0 (Efron et al. 2001). Instead of using raw P-values, a 
standard strategy consists in using the inverse probit trans- 
form, such that z n = 4> _1 (P(w n )), which results in centered 
standard Gaussian variables for the z under H 0l whereas the 
others follow an unknown density distribution f v Then, the 
posterior probability of being under H 0 is £FDR n = P(S n = 
1 \z n ). The decision rule consists in selecting positions 

n = 1 , . . . ,£, such that I = maxj/ : (1 //') £j =1 IFDR, < aj, 

where IFDR] , IFDR N is ordered and a is the false discov- 
ery rate (FDR) level (Benjamini and Yekutieli 2001). 

By mapping candidate sequences along the genome of 
species A, we expect the probability for one locus to have a 
higher pairwise nucleotide identity than expected to depend 
on its neighbors. Moreover, with the fragmentation of the 
candidate sequences due to the nucleotidic Blast, we also 
could detect small adjacent pieces of this locus instead of a 
unique DNA fragment, and because of their small sizes, 
each of these pieces of alignment could be statistically 
nonsignificant on its own. In the case of dependency, all the 
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Fig. 2. — Distribution of the pairwise nucleotide identity, genome wide 
with nonoverlapping windows of size 1 kb, of the Drosophila simulans 
genome alignment on the D. melanogaster genome. The vertical bars 
represent the values of the mean, the median, and the mode of this 
distribution. 

multiple-testing procedures not accounting for the depen- 
dency structure are suboptimal (Wei et al. 2009), meaning 
that if the procedure controls for the FDR for a given level, it 
does not minimize the false nondiscovery rate. Because 
decision at position n may depend on neighbor tests, we 
used the local index of significance (US) to compute 
P(S n = 0|zi, ...,z N ) (Sun and Tony Cai 2009). 

To proceed, we considered a homogeneous hidden 
Markov model in which S n is the hidden states (5„ e {0, 1}), 
which is governed by transition probabilities P(S n+ i \S n ). 
Moreover, we also accounted for the genomic context of 
each sequences, like GC content or the distance between 
the sequences that can influence the transition and emission 
probability of the model, as we do not expect the dependency 
of a given sequence to its neighbors to be the same between 
every sequences. We considered a logistic regression to 
account for covariates X-\,...,X N characterizing the se- 
quences, such that: 



P(S, =j\X : = x) 



exp 



(Xj+tf x x) 



£ exp(x k +p n k x x) 



k=0 



exp(o-j+p" x x) 
P(S n = y|S n _i = i,X n = x) = ^ '— 

E exp{°ik+P n k x x) 



Ar=0 



with ij = {0, 1} (Kuan and Chiang 2012). The model param- 
eters * = (k, i\ , A.,-, Ojj, pj), with k being the proportion of 



P-values equal to 1 , can be estimated using the EM algorithm. 
We developed a zero-inflated Gaussian distribution to handle 
unilateral tests with the appropriate z-values transformed. This 
model is implemented in the r package f drDEP available on 
the cran for multiple unilateral hypothesis testing. 

The LIS statistics are computed for each chromosome arm 
of the species A and concatenated to control for the FDR at a 
level of 10% for the whole genome of A with the Benjamini, 
Hochberg, and Yekutieli procedure (Wei et al. 2009). 

Filtering for True Putative HT Events 

With steps 1-3, we could have detected highly similar 
fragments of sequence alignments that would not have 
been significant for the whole corresponding sequences, so 
we first recovered the full length of each annotated DNA 
fragment detected in the species A. To reconstruct the com- 
plete sequences for these results, we used the bedtoois 
suite (version 2.17.0, options intersectBed -a annota 
tions.gff -b results. bed -wa) (Quinlan and Hall 

2010) to extract the annotated sequences corresponding to 
results with positions intersecting the ones from the species A. 
Then, we applied the two following filters to sort out con- 
served sequences from our results for nonrepeated and 
repeated sequences. 

For Nonrepeated Sequences 

For CDSs, we expect to observe an effect of selection because 
nonsynonymous mutations can be deleterious, neutral, or ad- 
vantageous. Thus, for the CDSs identified with our approach, 
we can compute their d s values using orthologous genes. We 
then performed the same unilateral binomial test as for the 
nucleotidic identity to determine whether the d s of a given 
CDS is significantly lower than the expected identity between 
the two species considered while controlling for the FDR at a 
level of 10% (Benjamini and Yekutieli 2001). 

In addition, to take into account non-CDSs that cannot be 
used in d s approaches, we developed a new validation proce- 
dure based on sequence conservation, which can be applied 
to both coding and non-CDSs. In the set of detected se- 
quences, a sequence identified with the same level of signif- 
icance, both between D. melanogaster (the species A) and 
D. simulans (the species B) and between D. melanogaster 
and other Drosophila species, would illustrate a conserved 
sequence across the phytogeny rather than multiple HTs at 
the same position in D. melanogaster. Thus, we performed 
the same analysis with four other species from the 12 
Drosophila genomes project: D. sechellia, D. yakuba, D. pseu- 
doobscura, and D. virilis, as a gradient of phylogenetically 
divergent species, before subtracting these results from 
those of the D. melanogaster-D. simulans analysis. We used 
the bedtoois suite to subtract the. bed tracks of the results 
of each species along the D. melanogaster genome. Figure 3 
describes the decision rule used in this subtraction according 
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to the corresponding phylogenetic tree. This step provided us 
with a landscape of all the sequences with a pairwise identity 
higher than expected between D. simulans and D. melanoga- 
ster and not conserved in the other Drosophila species. This 
last filter relies on the strong hypothesis that a pair of 
sequences absent between a given pair of species is not miss- 
ing due to random sampling of the population alleles in the 
sequenced individuals, a lack of genome coverage, or a 
misassembly. 

As TEs and other repeated sequences are present at multi- 
ple loci between a pair of genomes, they were excluded from 
this filtering step and were validated separately, considering 
that we could not discriminate which TE copy identified be- 
tween two genomes corresponded to a specific locus in the 
genome of the species A. 



For Repeated Sequences 

With our genome-wide approach, the set of TEs detected was 
not restricted to elements with a coding capacity, preventing 
us from relying on the d s metric for their validation. Moreover, 
for TE family with a large number of copies, we can expect 
one or more of these copies to be more identical than ex- 
pected between the two genomes just by chance. To account 
for the full set of detected TEs and analyze each detected TE 
family, we developed a new validating procedure based on 
the recent dynamics of the detected TEs in the genomes of 
species A and B. We worked under the hypothesis that, after 
an HT, a TE escapes the host defense mechanisms for a time 
and quickly replicates itself in the new host genome 
(Anxolabehere et al. 1988; Le Rouzic and Capy 2005; 
Granzotto et al. 2011). Thus, in the case of an identifiable 
horizontally transferred TE, we expected to observe many 
highly similar copies of the TE corresponding to this burst of 
transposition in one or both genomes, in contrast to few 



Lr°. 

L D. 



I — D. melanogaster 
D. simulans 
sechellia 
D. yakuba 
D. pseudoobscura 
D. virilis 



B 



Fig. 3. — Decision rule for the filtering step about selective pressure, 
with the presence (red) or absence (gray) of a pair of sequences between 
the corresponding species. (A) Putative HT between Drosophila melano- 
gaster and D. simulans. (6) Putative HT between D. melanogaster and 
D. simulans prior to the D. sechellia speciation event or conserved se- 
quences between D. melanogaster, D. simulans, and D. sechellia. 
(0 Conserved sequences in the melanogaster subgroup. (D-£) 
Conserved sequences with stochastic loss or ancestral polymorphisms. 



conserved TE insertions (Lerat et al. 201 1; Dias and Carareto 
2012). 

To help in the synthetic interpretation of the recent history 
of each TE in our results, we start by defining the most iden- 
tical pair of copies between the two genomes, as the last 
putative horizontally transferred copy in the case of an HT. 
For each TE family, this most identical pair of copies between 
the two genomes is defined as the pair of copies with the low- 
est P-values from all the detected copies using the 80-80-80 
rule (Wicker et al. 2007). Then, we Blast each of these most 
identical copies on the genome of A using a nucleotidic 
Blast (version 2.2.26) (Altschul et al. 1990). We built an 
index of the similarity of each copy of these elements com- 
pared with the most identical pair of copies between the two 
genomes, normalized by the size of the copies. We called this 
index the activity track. These activity tracks are used to rank 
between 0 and 1 all the copies of each identified TE according 
to their divergence from the corresponding most identical pair 
of copies between the two genomes, with 1 corresponding to 
a low degree of divergence and a recent activity of this TE and 
0 corresponding to old and divergent copies. The activity track 
corresponds to the probability of having a pairwise nucleotidic 
identity with the most identical pair of copies less than or 
equal to the expected identity 1 - p 0 , estimated using the 
reference distribution constructed from the global alignment 
of the two genomes. For every pair of TE copies n, we denote 
by W n the number of different nucleotides between the two 
copies. The distribution of W n is B(L n ,p n ), where L n is the 
length of the alignment between the copies and p n is the 
nucleotidic dissimilarity. Our aim was to compute for each 
pair of sequences n the probability P{W n < w° bs ) corre- 
sponding to the activity track. The same analysis is performed 
with the genome of species B to get an overview of the TE 
activity in both genomes. We developed in python the pro- 
gram activity_tracks .py (available from the online re- 
sources) to compute this index. 

Finally, we manually inspected the results in . bed format on 
each chromosome arm of D. melanogaster to look for cluster 
of sequences with a higher identity than expected using the 

integrative genome viewer Software (Thorvaldsdottir 
etal. 2013). 

All the statistical analyses in this article were performed 
using the software r (version 3.0.0) (R Core Team 2013). 

Data Acquisition 

We used the last available versions of the genomes of 
D. melanogaster (species A) (version r5.49), D. sechellia (ver- 
sion r1 .3), D. yakuba (version r1 .3), D. pseudoobscura (version 
r2.30), and D. virilis (version r1 .2) and the corresponding an- 
notation tracks from flybase (http://flybase.org [Marygold 
et al. 2013]). For D. simulans (species B), we did not work at 
first on the genome sequenced by the 1 2 Drosophila genomes 
project (Drosophila 12 Genomes Consortium 2007). Indeed, 
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this genome is a patchwork of six independently derived 
strains with the assembly of six major chromosome arms rep- 
resenting only 101.3 Mb of the 137.8 Mb expected. 
Moreover, this genome presents several major misassemblies 
and has the worst read quality of the 12 Drosophila genomes 
(Hu et al. 201 3). This is why we used the D. simulans genome 
that was resequenced in 2012 and assembled from the w501 
strain of the original Sanger data, in addition to a high-cover- 
age lllumina sequencing of iso-females of this same strain (Hu 
et al. 2013). However, to be able to compare our approach 
with previous studies, we also conducted a second analysis 
with the genome of D. simulans (version 1 .3) available from 
flybase (http://flybase.org). 

The genome pairwise alignments were retrieved from the 
UCSC website (http://hgdownload.cse.ucsc.edu). 

The sequence annotation tracks used to obtain the full 
length of the corresponding TEs and CDSs and annotate the 
noncoding DNA were downloaded from flybase (http://fly 
base.org) in.gff format (Marygold et al. 2013). 

Instead of computing the d s of the detected CDSs, we used 
the d s data of the 1 1,000 orthologs from the 12 Drosophila 
genomes, available from the study of Heger and Ponting 
(2007). 

Quality of the TE Content in the Genome of D. simulans 

We used the software SeqGrapheR (Novak et al. 2010) to 
analyze the TE content of the D. simulans genome directly 
from a uniform random sample of 900 k reads obtained 
from the 2012 genome project (sra:Srx159034) (Hu et al. 
2013). The assembled repetitions were annotated using 
RepeatMasker (version 3.3.0) (Smit AF, Hubley R, Green P, 
unpublished data). 

Data Access 

All the scripts used for our pipeline are available in a git 
repository at: git://dev.prabi.fr/modolo2013. 

Results 

Genome-Wide Detection of Sequences with a Higher 
Nucleotidic Identity than Expected 

Defining the Set of Candidates for HT Detection 

We kept the best local alignments obtained by the Blast search 
of D. simulans against D. melanogaster ior each position in the 
genome of D. melanogaster, thereby taking into account the 
repeated content that is often removed from genome-wide 
alignment (i.e., best global alignment). The cumulative size of 
the filtered sequences decreased with the divergence time 
between a given species and D. melanogaster, which is con- 
sistent with the nucleotidic Blast algorithm (table 1). For 
example, we retrieved approximately 112 Mb of sequences 
between D. melanogaster and D. simulans (divergence time 



of 5.4 ±1.1 Myr), compared with only 13 Mb between 
D. melanogaster and D. virilis (divergence time of 42.9 ± 8.7 
Myr). However, such a trend was not observed for the number 
of filtered sequences, which can be explained by the fragmen- 
tation of the retrieved sequences, which increased with the 
phylogenetic distance (table 1 ). With this set of candidates, we 
used our method to determine whether the observed pairwise 
nucleotidic identity for each of these pairs of sequences was 
higher than expected between the considered species and D. 
melanogaster. 

Assessing the Reference Distribution for 
Nucleotidic Identity 

We computed a reference nucleotidic identity distribution 
with the analysis of the global alignment of the genome of 
D. simulans along the genome of D. melanogaster (fig. 2). This 
distribution accounted for the variations in nucleotidic identity 
along the two genomes, in contrast to the common mutation 
rate of 1.1 ± 0.2 x 10~ 8 mutations per site per year per lin- 
eage for the Drosophila phylogeny that has been computed 
on a limited number of nuclear genes (Tamura et al. 2004). 
Consequently, this mutation rate based on the molecular 
clock hypothesis (Weir and Schluter 2008) may not be repre- 
sentative of the pairwise nucleotidic identity between the 
whole genomes of D. melanogaster and D. simulans 
(Drosophila 1 2 Genomes Consortium 2007) and is not suitable 
for a genome-wide analysis. For the detection of HTs between 
D. melanogaster and D. simulans, we were only interested in 
the expected nucleotide identity corresponding to the accu- 
mulation of mutations between these two species since their 
time of divergence. Thus, we choose the highest mode of 
identity distribution as a reference to compute the unilateral 
P-values of our tests, which quantified the probability of each 
candidate to have a nucleotidic identity exceeding 95.63%, 
while accounting for the size of the alignment (fig. 2). 

Controlling for False Positives in the Context 
of Genomic Dependencies 

As in many genomic studies, the number of statistical tests to 
perform was large (168,325 pairs of sequences for the com- 
parison D. melanogaster vs. D. simulans). If no multiple-testing 
procedure is applied, we can roughly expect to declare an 
average of 10% of the tests (16,832) to be false positives by 
retrieving all the P-values below 0.1, which can be higher than 
the number of true positives (Finner and Roters 2002). By ap- 
plying the standard Benjamini-Hochberg multiple-testing cor- 
rection with an FDR level of 10% (Benjamini and Hochberg 
1995), without taking into account the dependency structure 
between the tests, we only retrieved 605 CDSs, 934 TE inser- 
tions, and 2,345 intergenic DNA fragments. Thus, we used 
our method to assess the probability that each pair of se- 
quences has a higher pairwise identity than expected while 
accounting for its dependency to its neighbors, adjusted to 
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Table 1 

Results of the Filter of the AII-to-AII Nucleotidic Blast between Drosophila melanogaster and the Corresponding Species 



Species 




Sequence Size (kb) 




Number of Seq 


jences 


Divergence Time to 
D. melanogaster (Myr) 


Row 


Filtered 


Significant 3 


Row 


Filtered 


Significant 3 


D. simulans 


550,226 


112,748 


9,012 


4,468,121 


168,325 


11,927 


5.4 


D. sechellia 


1,219,599 


111,909 


5,452 


7,947,377 


170,394 


7,025 


5.4 


D. yakuba 


1,972,352 


91,584 


977 


23,960,790 


239,01 1 


3,185 


12.8 


D. pseudoobscura 


102,146 


22,241 


593 


1,431,447 


213,790 


11,323 


30.0 


D. virilis 


184,640 


13,463 


298 


2,186,411 


117,831 


6,305 


42.0 



Note. — Row, results corresponding to a many-to-many cardinality; filtered, results corresponding to a one-to-one cardinality. 
a Results corresponding to the significative identity-based tests after multiple-testing correction. 



other genomic covariates to increase our sensitivity. Indeed, 
the GC content of the sequences as well as the distance be- 
tween a pair of sequences and the next on a chromosome 
arm and the presence of TEs are likely to be proxies of the 
similarity of a pair of sequence to its neighbors. We also ex- 
pected the recombination rate to be an important factor, but 
no significant correlation was found between the recombina- 
tion data available for the genome of D. melanogaster and the 
P-values of our tests. With this correction applied to the 
168,325 tests, we retrieved 7.3 Mb of sequences, including 
2,651 fragments from CDSs (2.46 Mb), 3,967 fragments from 
insertions of 28 different TE families (201 kb), and a large 
number of intergenic DNA fragments (13,806 sequences 
corresponding to 4.68 Mb), between D. melanogaster and 
D. simulans. 

Distinction between "True" HT Events and Biological 
False Positives 

HT Sequences in the Light of Other Drosophila Species 

We detected a set of sequences with an identity higher than 
expected between the genomes of D. simulans and D. mela- 
nogaster that was not reduced to HT sequences, thus we 
started by retrieving the full-length sequences of each anno- 
tated fragment from the genome of D. melanogaster. Then, 
we discriminated putative HT sequences from the sequences 
displaying a signature of functional constraints. We tested 
whether the d s of the 2,651 detected CDSs was significantly 
lower than expected in the D. melanogaster-D. simulans anal- 
ysis, and we finally retained 26 CDSs. 

To discriminate between conserved and horizontally trans- 
ferred sequences for the full set of detected nonrepeated 
sequence (i.e., both coding and non-CDSs), we used the com- 
parative analysis between D. melanogaster and D. simulans, 
and between D. melanogaster and other Drosophila species. 
This subtraction allowed us to remove approximately 40% of 
the base pairs for the intergenic DNA (thus keeping 2.79 Mb 
of the 4.68 Mb), with a consistently high pairwise identity with 
D. melanogaster in this phylogeny. This result is consistent 
with the results from Casillas et al. (2007) where 38.6% of 



the noncoding DNA in D. melanogaster display the signature 
of functional constraints. We also retained 28 of the 2,651 
CDSs with this second approach. 

The intersection of the results from the d s analysis with the 
ones from this subtraction led to the detection of 1 1 CDSs 
annotated from RNA-Seq data but of unknown function 
(Marygold et al. 2013). These 1 1 CDSs were sparsely distrib- 
uted along all the major chromosome arms of D. melanoga- 
ster and found in clusters of CDSs with significant pairwise 
nucleotide identity but nonsignificant d s . Thus, these 1 1 CDS 
in our results could be biological false positives caused by the 
dependency model used in the multiple-testing correction, 
lowering their probability of being under the null hypothesis 
due to their conserved neighbors, which does not support the 
hypothesis of their HT. For the detected noncoding DNA, we 
were not able to use the D. melanogaster annotations to 
retrieve the full-length sequences of the DNA fragments. 
This class of fragmented DNA, representing 63.91 % of the 
detected DNA in our results, was annotated based on the 
D. melanogaster annotation tracks (supplementary table S1, 
Supplementary Material online) but was only analyzed as 
neighboring sequences of the detected CDSs and TE 
sequences. 

Horizontally Transferred TEs 

For the repeated sequence, we used the activity track to study 
the recent activity of the detected TE family. According to 
the activity track distributions, most of the detected TE families 
in our results presented a recent period of activity in 
D. melanogaster (supplementary figs. S2 and S3, Supplemen- 
tary Material online), with a large number of copies highly 
similar to the most identical pair of copies between the two 
genomes (see e.g., the diver element, fig. 4/4). However, for 
some elements such as the ancient element INE-1, described 
as having invaded the ancestor lineage of D. melanogaster 
and D. simulans (Kapitonov and Jurka 2003), the activity 
track showed a majority of divergent copies with only few 
ones close to the most identical copy between the two ge- 
nomes (fig. 46), as expected by chance for a large number of 
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Fig. 4. — Density distributions of the activity tracks computed with the 2007 version of the genome of Drosophila simulans. The rightmost black bar 
corresponds to the most identical pair of copies between the two genomes, whereas the colored bars represent the number of copies ranked according to 
their similarity to this most identical pair of copies for a given TE. The red bars represent the activity tracks in D. melanogaster, whereas the blue bars represent 
the activity tracks in D. simulans 2007. {A) Example of a TE family presenting a recent period of activity corresponding to a putative HT from D. simulans 
toward D. melanogaster. (B) Example of a TE family with an activity not consistent with an HT between D. simulans and D. melanogaster. (Q Example of a TE 
family with different waves of activity. 
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old and degraded copies. With the activity track ranking the 
TE copies according to the most identical pair of copies be- 
tween the two genomes which can be seen as the last puta- 
tive horizontally transferred copy in the case of an HT, we 
were able to balance the direction of the transfers, which is 
crucial to understand the horizontally transferred TE history 
and dynamics. In the case of a horizontally transferred TE 
from a first species toward a second species, we can expect 
the TE to be present in a small number of highly similar and 
potentially active copies in the first species and in a large 
number of highly similar copies in the second species (supple- 
mentary fig. S3, Supplementary Material online). We observed 
this pattern for elements such as diver, which had a large 
number of copies with an activity track close to 1 in D. mel- 
anogaster and few copies in D. simulans (fig. 44). This pattern 
was consistent with its HT from D. simulans toward D. mela- 
nogaster, even with few copies that were statistically signifi- 
cant in the two genomes. In the genome of D. simulans, most 
of the activity track distributions were bimodal, with few TE 
copies close to 1 and a large number of copies close to 0 
corresponding to old and degraded copies (supplementary 
fig. S3, Supplementary Material online), which was consistent 
with the observations made for some of these TE families in 
the genome of D. simulans (Lerat et al. 201 1). In contrast, in 
D. melanogaster, most of the TE copies had an activity track 
close to 1 , which was representative of young and active TE 
populations. These differences of TE landscape between these 
two species support the hypothesis of multiple horizontally 
transferred TEs from D. simulans (fig. 4/4) and from other 
species (fig. 40 toward D. melanogaster. 

With the 2012 version of the genome of D. simulans, we 
were able to identify 21 TE families with 10 new cases of 
horizontally transferred TEs, which were not previously iden- 
tified as horizontally transferred between these two species 
(supplementary fig. S2, Supplementary Material online). 
However, 1 1 TEs were missing from the 24 horizontally trans- 
ferred TEs previously described by different studies between 
D. simulans and D. melanogaster (de la Chaux and Wagner 
2009; Bartolome et al. 2009; Carareto 2011; Lerat et al. 
2011). From these 11 TEs, the elements were only present 
in a few noncomplete copies in the 2012 version of the 
genome of D. simulans, which explain their absence from 
our results. For the elements F, copia, gypsy5, and gypsylO, 
the TE copies were highly divergent from those present in the 
genome of D. melanogaster and displayed a nonsignificant 
nucleotidic identity. To confirm the absence of the 472 ele- 
ment, known to be active in some populations of D. simulans 
(Vieira and Biemont 1997), we performed a de novo assembly 
of the TEs directly from the reads of the 2012 D. simulans 
genome project. The reads corresponding to this 7,566 bp 
element represented 50 kb of the 137.8 Mb genome of 
D. simulans, with the majority of the reads matching the 
long terminal repeats and few reads mapping within the ele- 
ment, which was concordant with the 2012 assembly. 



Therefore, the absence of these 1 1 horizontally transferred 
TEs from our results was likely the result of their absence 
from the assembled strain in the 2012 version of the 
genome of D. simulans rather than a lack of sensitivity of 
our method. Using the genome of D. simulans from the 12 
Drosophila genomes project (Drosophila 12 Genomes 
Consortium 2007) (supplementary fig. S3, Supplementary 
Material online), we were able to recover in one analysis the 
24 HTs previously described in the literature, including the 1 1 
families missing from our analysis with the D. simulans 
genome of 201 2. The 1 0 new and the 24 previously described 
TEs all presented activity track distributions consistent with the 
after effect of a horizontally transferred TE from D. simulans 
toward D. melanogaster (supplementary figs. S2 and S3 and 
table S2, Supplementary Material online). Thus, given the 
number of horizontally transferred TEs detected between 
D. melanogaster and D. simulans in the short time since 
their divergence, a parsimonious hypothesis could be the 
introgression of one or more fragments of DNA containing 
different TEs instead of multiple independent HTs. 

Introgression versus Multiple HT Events 

To obtain a broader view of the HTs between D. melanogaster 
and D. simulans, and discriminate between introgression and 
multiple independent HTs, we manually inspected the 1 1 
CDSs, the TE insertions from the 21 families left and the 
10,232 fragments of noncoding DNA in the final results 
along each chromosome arm of D. melanogaster and with 
the 2012 genome of D. simulans. In the case of introgression, 
we expected to observe the simultaneous transfer of these 
three types of sequences in one large DNA fragment. 
However, we found no sequence containing three or even 
two of these different types of sequences in the final results. 
This absence of completely introgressed fragments could be a 
consequence of the fragmentation of the detected sequences 
between the two genomes. However, we also did not find any 
obvious clusters of these different types of DNA along the 
chromosome arms of D. melanogaster. Overall, the types of 
detected sequences in our study support the prevalence of TEs 
and noncoding DNA in HTs between these two species. 
However, the informations contained in the genome of the 
sequenced individuals are not sufficient to support the 
hypothesis of multiple independent horizontally transferred 
TEs toward D. melanogaster rather than introgression events. 

To better understand the horizontally transferred TEs in- 
volving D. melanogaster, we performed the same horizontally 
transferred TE analysis with the data from our comparison 
with the four other Drosophila species (D. sechellia, 
D. yakuba, D. pseudoobscura, and D. virilis) (supplementary 
figs. S4-S7 and table S2, Supplementary Material online). We 
detected numerous horizontally transferred TEs in this re- 
stricted window of time starting 5.4 ±1.1 Ma, corresponding 
to the expected identity threshold between D. melanogaster 
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and D. simulans. The comparison between D. melanogaster 
and D. sechellia provided evidence for HTs of 21 elements 
between these two species (supplementary fig. S4, Supple- 
mentary Material online). Drosophila sechellia is the only spe- 
cies with a divergence time to the ancestor that it shares with 
D. simulans within the time window of our analysis, so for this 
species we can discriminate between horizontally transferred 
TEs involving D. melanogaster and the ancestor of D. simulans 
and D. sechellia, and horizontally transferred TEs involving 
D. melanogaster and D. sechellia (fig. 3). For the element 
(with an activity track nonconsistent with an HT in the 2007 
version of the genome of D. simulans and absent from the 
201 2 version), the results rather indicate recent activity in both 
species, which suggest the existence of a third donor species, 
such as another D. simulans strain than those sequenced in 
2007. We also observed the same pattern for 17 elements 
between D. melanogaster and D. yakuba (supplementary 
fig. S5, Supplementary Material online). However, we did 
not find any clear evidence of recent horizontally transferred 
TEs or a burst of transposition in the analysis of the TEs 
detected between D. melanogaster and D. pseudoobscura, 
or between D. melanogaster and D. virilis, which could be 
explained by the degree of fragmentation of the correspond- 
ing sets of sequences (supplementary figs. S6 and S7, 
Supplementary Material online). Finally, recent activity of the 
Roo element was detected in D. melanogaster, D. simulans, 
D. sechellia, and D. yakuba, which could be the result of in- 
dependent HTs of this element toward these four species after 
their divergence (de la Chaux and Wagner 2009). 

Discussion 

In eukaryotes, the study of horizontally transferred sequences 
is confined to CDSs and often focuses on specific TEs. Thus, 
there are two systematic biases in the detection of HTs in 
eukaryotes: the candidate HT must be known and must 
have a coding capacity (Keeling and Palmer 2008). We pro- 
pose a new genome-wide approach that aims to bypass these 
biases inherent to sequence-specific approaches by consider- 
ing all the best local alignments of one genome to another as 
possible horizontally transferred sequences. Then, we test 
each of these sequences to retrieve those with a higher 
nucleotidic identity than expected between the two species 
while accounting for the multiplicity of the tests and their 
dependency structure throughout the target genome. We de- 
tected 2,651 CDSs, 3,967 insertions from 28 different TE fam- 
ilies, and a large number of intergenic DNA fragments 
(13,806) more identical than expected from the 4,468,121 
pairs of sequences identifiable between D. melanogaster 
and the 2012 version of the genome of D. simulans. Finally, 
we discriminated between spurious HT detection and putative 
HTs in our results with two novel validation procedures for 
genome-wide HT detection. And after manual inspection of 
the results, we retained 21 TE families as horizontally 



transferred between these two species, validating the preva- 
lence of TE sequences in HTs between these two species with- 
out detection bias toward this type of sequence. 

Genome-Wide Identification of Putative Horizontally 
Transferred Sequences 

Previous genome-wide approaches used a wide range of pro- 
cedures to infer sequences more identical than expected given 
the phylogeny of the species to detect HTs in eukaryotes 
(Loreto et al. 2008; Wallau et al. 2012). However, none of 
these procedures relied on a statistical testing framework to 
validate their sensitivity and specificity. This explains why se- 
quence-specific approaches are still used: their particular reli- 
ability despite the limited set of sequences considered (Wallau 
etal. 2012). The collegia! tests for the identity-based detection 
of horizontal transferred sequences in eukaryotes rely on the 
synonymous substitution rate, often in the form for a codon- 
based Z-test (Pace et al. 2008; Gilbert and Cordaux 2013). In 
our study, the set of candidates sequences was not restricted 
to the small coding portion of eukaryote genomes, and this 
justified the use of a binomial test to retrieve the sequences 
with a higher pairwise nucleotidic identity than expected 
between two species without any codon information while 
accounting for the size of each candidate. This simple model 
for codon substitution is sensitive enough to detect recent HTs 
for which we can expect a small saturation between se- 
quences. The saturation corresponds to the occurrences of 
multiple mutations at a single nucleotide (or site), which 
leads to an underestimation of the nucleotidic divergence be- 
tween two sequences because we can only observe the last 
mutation in the case of multiple mutations per site. A pair of 
sequences with saturation is expected to have more single 
mutations per site than multiple mutations per site. Thus, 
the complex cases, ill-defined by the model, will also corre- 
spond to the sequences in the "uninteresting" side of our 
unilateral hypothesis and will be correctly assigned to the set 
of nonsignificant sequences. 

In genome-wide analyses, we often face multiple-testing 
issues, and our results underscore the importance of working 
with a well-defined statistical framework to control the 
number of incorrect detections and increase the power of 
the study. We also took advantage of the fact that when 
comparing two genomes, we always have a genome of 
better quality to map the detected candidate sequences, to 
greatly reduce the dimensionality of the data to be analyzed, 
thus increasing the power of our study (Storey and Tibshirani 
2003). For our analysis, the now standard Benjamini- 
Hochberg FDR (Benjamini and Yekutieli 2001) procedure 
had a too low specificity to produce relevant results. This 
was caused by the dependency between the tests in our anal- 
ysis, which was taken into account with the US framework to 
increase the specificity of our approach (Sun and Tony Cai 
2009). Modeling this dependency between each pair of 
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candidates along the chromosome arms of D. melanogaster 
with a homogeneous Markov model (HMM) was not suffi- 
cient to retrieve all the horizontally transferred TEs described in 
the literature with the genome of D. simulans from the 1 2 
Drosophila project (Drosophila 12 Genomes Consortium 

2007) . For our purposes, an HMM would have tended to ho- 
mogenize the US statistics according to the information pro- 
vided by the adjacent without taking into account any 
information about the type of sequences or the distance be- 
tween them (i.e., the nonsignificant pairs of sequences sur- 
rounding a TE copy), which can explain these missing 
horizontally transferred TEs. With an NHMM, we were able 
to enrich the standard Markovian dependency according to 
covariates, such as the distance between the statistical tests 
along the genome and the presence of TEs, and to detect the 
24 horizontally transferred TEs described in the literature 
(Carareto 201 1) with an FDR level of 10% and the 2007 ver- 
sion of the genome of D. simulans, in addition to 1 0 new ones 
with the 2012 version of the genome (supplementary table 
S2, Supplementary Material online). 

Our approach can be applied to any pair of sequenced 
species in which one species has an assembled genome, 
into which candidate sequences will be placed, to model the 
dependency structure between the tests with a NHMM. The 
specificity of this method is high enough to detect sequences 
more identical than expected between closely related species 
while controlling for the FDR in the results. This procedure 
could also be used with any other unilateral tests for different 
biological problems or to model the nucleotidic differences in 
ancient HTs or between more divergent species with a greater 
prevalence of saturation. 

Two New Methods to Confirm HTs in Genome-Wide 
Studies 

Most of the methods for HT detection in eukaryotes use se- 
quence-specific approaches and rely on strong d s evidence to 
infer putative HTs (Bartolome et al. 2009; Lerat et al. 201 1). In 
the remaining studies, the candidate sequences generally in- 
volve distantly related species and recent HT events where the 
identity line of evidence can be self-sufficient (Loreto et al. 

2008) , for example, the case of the TEs SPIN and OC7 
(Gilbert et al. 2010, 2013). This can also be the case for 
recent HTs, such as the well-known example of the P element 
transferred from D. willistoni to D. melanogaster less than 1 00 
years ago, for which the nucleotidic identity is almost of 1 00% 
between the two species (Daniels et al. 1990). We were able 
to retrieve sequences with an identity percentage higher than 
99% between D. melanogaster and D. yakuba for the ele- 
ments Doc, jockey, and transib3, which was unexpected and 
could be sufficient to infer their HT. However, the number of 
obvious cases was small, and we needed to confirm the other 
HTs by other lines of evidence. 



When studying the pattern of sequence divergence 
between genomes to infer HTs, we can encounter a large 
number of confounding factors that need to be checked 
(Siepel et al. 2005; Pollard et al. 2010). These factors range 
from natural turnover (gain or loss of functional elements) to 
the effect of purifying and positive selection, which can act on 
entire sequences or on parts of sequences, canceling the 
effect of divergence. For the study of HTs, we can add to 
this list the effect of different evolutionary rates for the se- 
quence under consideration or the effect of stochastic losses 
in the phylogeny of the candidate sequence(s) (Loreto et al. 
2008). Moreover, we have to rely on orthologs and sequence 
identification, which is nontrivial and can lead to numerous 
false positives (Gronau et al. 2013). The possibility of mis- 
placed DNA sequences in the genomic database, polymerase 
chain reaction mispriming, contamination, incomplete se- 
quence data, and poorly rooted trees can also be technical 
sources of errors for HT detection (Lisch 2008). Therefore, to 
differentiate between putative HT events and the possibility 
of vertical transmission, we need to investigate other lines of 
evidence (Loreto et al. 2008; Gilbert et al. 2010). In genome- 
wide studies of HT, in contrast to sequence-specific 
approaches, all the candidate sequences are not assumed to 
have been horizontally transferred from one species to the 
other, and the procedures need to include validation steps 
to produce sound results. 

Validation of the Nonrepeated Content 

Our approach follows an identity-based line of evidence to 
detect HTs, so we would need phylogenetic clues to validate 
them. In the case of D. melanogaster and D. simulans, which 
are almost at a terminal node of the Drosophila phylogenic 
tree, phylogenetic incongruences would mostly consist of 
nonsignificant differences in branch lengths compared with 
those expected. Even if incomplete lineage sorting could 
remain a problem, for a sequence-specific identity-based 
approach, the validation procedures mainly consist of showing 
evidence that the high observed nucleotidic identity is not 
the result of other mechanisms than HT, such as purifying 
selection or a mutational cold spot (Pace et al. 2008; Casillas 
et al. 2007). When dealing with genome-wide data, tools 
such as scone or the ones from the phast package can 
produce conservation tracks from multiple genome-alignment 
between different species (Asthana et al. 2007; Hubisz 
et al. 2011). However, these conservation tracks consist 
of quantitative scores to measure the departure from neu- 
trality for each nucleotide, and these scores are difficult to 
incorporate into a statistical test to determine whether a 
given detected fragment is conserved or horizontally 
transferred. 

We thus developed a more conservative approach that also 
accounted for non-CDSs, by subtracting the results of the 
D. melanogaster-D. simulans comparison from those retrieved 
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for the comparison between D. melanogaster and other 
Drosophila species (D. sechellia, D. yakuba, D. pseudoobscura, 
and D. virilis). With this comparative analysis, we discriminated 
between putative horizontally transferred sequences and se- 
quences under purifying selection that are expected to be 
conserved across the phylogeny genome-wide for coding 
and noncoding DNA. The use of those four other species 
strengthened our results by preventing the detection of sto- 
chastic loss or ancestral polymorphisms. These two mecha- 
nisms could lead to the detection of conserved sequences 
between D. melanogaster and D. simulans that are absent 
from a third species, which is unlikely to occur simultaneously 
in the five analyzed species. 

Finally, as this line of evidence is easily accessible and should 
always be considered for the study of the HT of CDSs, we also 
checked for CDS d s values lower than expected given the time 
of divergence of the considered species. Overall, our results 
show an absence of HTs involving CDS between Drosophila 
species (Schaack et al. 201 0), which is not caused by detection 
bias toward these types of sequences. Because this validation 
procedure is restricted to the nonrepeated content of our re- 
sults, we also developed a second validation procedure to 
assess the TEs identified. 

Validation of Horizontally Transferred TEs 

To identify horizontally transferred TEs among the set of TEs 
with an identity higher than expected between D. melanoga- 
ster and D. simulans, we analyzed their recent dynamics since 
their last putative HT (Dias and Carareto 2012). The TE dy- 
namics and maintenance in the host genome can be described 
as a birth-and-death processes (Schaack et al. 2010; Le Rouzic 
et al. 2013). The death of a TE corresponds to the inactivation 
of all its copies by the host defense mechanisms or the accu- 
mulation of disabling mutations (Jurka et al. 2012). On the 
other hand, the birth of a TE corresponds to an active copy 
colonizing a novel host devoid of specific transposition con- 
trols against this TE, which immediately leads to the burst of 
transposition of the founder copy in the new genome (Le 
Rouzic and Capy 2005; Naito et al. 2009). Bursts of transpo- 
sition have been recorded for different TEs in numerous 
Drosophila species (Garcia Guerreiro 2012) and can be easily 
identifiable because all the resulting TE copies are almost iden- 
tical to each other. Afterward, most of the copies accumulate 
stochastic mutations and are lost over time by attrition, except 
for a minority of copies that can become exapted and can be 
identified as DNA segments conserved across species 
(Margulies et al. 2003; Siepel et al. 2005; Pace et al. 2008). 
Because TEs are likely to evolve neutrally after their insertion, 
we could use the neutral rate of substitution to compute the 
timing of a burst of transposition by calculating the pairwise 
divergence between all the TE copies and their consensus as 
an approximation of the founder copy as described in the 
literature (Pace and Feschotte 2007; Schaack et al. 2010; Le 



Rouzic et al. 2013). However, the consensus is not always a 
good approximation of the ancestral copy. Thus, instead of 
studying the complete history of a TE family with a consensus 
approach, our method focuses on the period of time sur- 
rounding its last putative HT between the considered species 
and ignores the events older than the divergence time be- 
tween D. melanogaster and D. simulans. Thus, this change 
in the time scale provided us with a better temporal resolution 
for the study of the last bursts of transposition. In D. melano- 
gaster, where the TE activity was recent (Bowen and 
McDonald 2001 ; Lerat et al. 2003), we were not able to clearly 
discriminate between the different activity periods of the TE 
families with an approach based on an estimated neutral mu- 
tation rate between all the copies of a TE family and their 
consensus sequences (Ray et al. 2008). Moreover, for the 
TEs with different waves of activity, such as the element tran- 
sib3 (fig. 4Q in the studied species, a consensus would corre- 
spond to a hypothetical copy dated in the middle of the waves 
of activity rather than to the ancestral copy. Our approach 
solves these drawbacks of consensus-based TE analysis and 
accounts for highly variable lengths of copies between TE 
families. 

Another important point concerning HTs is to determine 
the direction of these transfers. In the cases of horizontally 
transferred TEs, we could expect a species with a high 
number of TE copies to have a higher probability to horizon- 
tally transmit one of its copies to another species, resulting in 
numerous identical copies in one species and few in the other. 
For this scenario to be valid, the transferred TEs would need to 
be almost instantly regulated in the receiver species to stay at 
a low copy number or for the receiver species to be se- 
quenced before their burst of transposition. For both cases, 
these TE insertions would not have a high frequency in the 
species and would most likely not be observed in the se- 
quenced strains. In an opposite scenario (a horizontally trans- 
ferred TE from a species with few putatively active but 
controlled TE copies), a TE is transferred to a species where 
this TE is unknown for the host regulation system, which 
would lead to a burst of transposition and a quick fixation 
of this TE in the receiver species. Consequently, we are more 
likely to observe the results of this second scenario in the 
sequenced individuals, and we can use it to decipher the di- 
rection of detected horizontally transferred TEs (Dias and 
Carareto 2012). 

Overall, our results show that different waves of activity 
seem to have occurred for different TE families and that 
their dynamics can be used to describe the numerous horizon- 
tally transferred TEs between D. melanogaster and D. simu- 
lans. After a horizontally transferred TE and a burst of 
transposition, we expect to observe a unique wave of activity 
before the control of the element, so the presence of other 
waves seems to be indicative of a complex history of the TE 
dynamics in Drosophila. 
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Drosophila melanogaster as a Target of Multiple 
Independent Horizontally Transferred TE Events 

Exchange of TEs with D. simulans 

In regard to the number of horizontally transferred TEs 
that have been detected, a parsimonious hypothesis would 
be their simultaneous transfer by introgression instead of in- 
dependent HTs. Thus, we can wonder why no traces of intro- 
gression were detected between D. melanogaster and 
D. simulans, when hybrids are known to have been possible 
between these two species (Sawamura et al. 2004; Barbash 
2010). A first genomic explanation could be that due to mu- 
tations and recombinations, the signal of an introgressed frag- 
ment of DNA has been lost over time. In this case, we can 
wonder why this DNA fragment would have undergone 
such high recombination and mutation rates, when most 
of the DNA is still identifiable between D. melanogaster and 
D. simulans. 

As "nothing in evolution makes sense except in light of 
population genetics" (Lynch 2007), we can try to answer 
this question at a population level. For TEs, many steps are 
necessary for an HT to be successful (Le Rouzic and Capy 
2005). After passing through the new host barriers, the TE 
must transfer itself into the germ line to be transmitted to the 
descendants. Then, the TE needs to have a sufficient transpo- 
sition rate to propagate into the host genome and to increase 
in frequency in the species by vertical transmission. For TEs, 
which are able to actively colonize genomes, most of the TE 
insertions in natural populations are absent from the se- 
quenced genome, as shown by the study of 1 13 D. melano- 
gaster strains isolated from natural population (Kofler et al. 
2012). 

In the case of an introgression, all the cells in the progeny of 
the backcross with the hybrid will have a copy of the intro- 
gressed DNA fragment, so the first step of contaminating the 
germ line is always successful. Afterward, this introgressed 
DNA fragment has to increase in frequency in the species to 
be likely to be observed in the individuals actually sequenced. 
In contrast to the active TE copies, the introgressed fragment 
cannot actively replicate itself in the new genome, and its 
probability of fixation is simply its frequency in the population, 
at least in diploid organisms with a large effective population 
size, such as D. melanogaster (Nolte and Schlotterer 2008). As 
a result, the frequency of this introgressed fragment would be 
almost null in comparison to the effective population size of 
D. melanogaster, and even with the carrier subpopulation hy- 
pothesis (Jurka et al. 201 1), where the population is divided 
into demes in each of which we can observe an effect of 
genetic drift that favors the fixation of low-frequency alleles, 
the introgressed fragment would have a low probability to be 
transmitted to the other demes and to be fixed in the species 
(Ghosh et al. 2012). Therefore, we would need to use 
D. melanogaster and D. simulans population-genetic data to 
be able to detect any traces of introgression events, as in the 



recent study of introgression events between D. simulans and 
D. sechellia from Brand et al. (2013). 

This population-genetic aspect of the genomic data needs 
to be taken into account, as it can explain other aspects of our 
TE-based results. For example, the differences in the detection 
of horizontally transferred TEs between D. melanogaster and 
D. simulans found between the 2012 version of the D. simu- 
lans genomes sequenced from one strain (Hu et al. 201 3) and 
the version from the 12 Drosophila genomes project se- 
quenced from five different strains (Drosophila 12 Genomes 
Consortium 2007) can be explained by the variability of TE 
insertions between the populations of D. simulans (Vieira and 
Biemont 2004). 

With Other Drosophila Species 

Overall, the extensive evidence of horizontally transferred TEs 
detected in D. melanogaster seems to indicate that the fixa- 
tion of new TEs could be facilitated in this genome. The timing 
of most HTs involving D. melanogaster was estimated be- 
tween 1.4 and 2.3 Ma, before the worldwide expansion of 
D. melanogaster and D. simulans that happened 1 5,000 years 
ago (Stephan and Li 2007; Carareto 201 1). The melanogaster 
subgroup is endemic to Afrotropical regions, with the proto- 
melanogaster founder dated between 1 7 and 20 Ma from the 
oriental region of Africa (Lachaise et al. 1988). Thus, a parsi- 
monious hypothesis for the numerous horizontally transferred 
TEs detected among D. melanogaster, D. simulans, D. sechel- 
lia, and D. yakuba would place them at a time when these 
species were all living in Africa, before the worldwide expan- 
sion of D. melanogaster and D. simulans. In this scenario, there 
would have been fewer geographical barriers to hamper the 
fixation of horizontally transferred TEs into sympatric popula- 
tions with a smaller repartition area than the worldwide pop- 
ulations of D. melanogaster. The arrival of these new TE copies 
in the genome D. melanogaster may have been a springboard 
for the worldwide expansion of this species, as the load of TEs 
can be correlated with the colonization of new territory (Vieira 
et al. 1999, 2002). In contrast, a stronger population structure 
in D. simulans (Mousset and Derome 2004) could explain the 
polymorphisms of TE insertion that have resulted in different 
TE loads between populations (Vieira and Biemont 2004) and 
that may have independently favored the worldwide expan- 
sion of this species, even if in both cases the cause of such a 
mechanism is not yet understood. 

Conclusions 

We have developed a novel approach for the genome-wide 
detection of all putative HT sequences independently of their 
coding capability between two genomes. Our method relies 
on a well-defined testing framework to approach this 
genome-wide problem as a multiple-testing problem. We suc- 
cessfully applied this method between the genomes of 
D. melanogaster and D. simulans, underscoring the sensitivity 
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of our approach to detect HTs between closely related species. 
Like previous studies of HTs in eukaryotes, we validated these 
results with other lines of evidence. We also proposed two 
novel approaches to remove bias due to the detection of con- 
served sequences, by a comparative analysis with phylogenet- 
ically related species in the case of CDS and non-CDSs and by 
an analysis of their recent activity in the case of TEs. After 
these validation steps, we retrieved all the horizontally trans- 
ferred TEs previously described in different studies (see 
Carareto [2011] for a review) and very few spurious CDS, 
attesting to the sensitivity and the specificity of our approach. 

By a manual analysis of our results along each chromosome 
arm of D. melanogaster, we did not detect any trace of intro- 
gression between D. melanogaster and D. simulans, even if 
this does not completely rule out this hypothesis. We also 
detected a large number of horizontally transferred TEs involv- 
ing D. melanogaster and other Drosophila species with our 
assessment steps, bringing to light a small portion of the net- 
work of horizontally transferred TEs in this phylogeny. This 
large number of HTs for different TE families also supports 
the model of birth and death, where HT events are a vital 
part of the TE life cycle that prevents their extinction 
(Schaack et al. 2010). We are just beginning to understand 
the complex horizontally transferred TE network in eukary- 
otes, and our approach could be applied to any pair of se- 
quenced species to increase our knowledge of the dynamics 
of these sequences, which seem to jump both within and 
between species. 

Supplementary Material 

Supplementary figure S1-S7 and tables S1 and S2 are avail- 
able at Genome Biology and Evolution online (http://www. 
gbe.oxfordjournals.org/). 
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